Overview

What are a meaningful measures of engagement that S2M can use to predict longer tenure on app? If we can define behavioral patterns predictive of long-term usage, S2M can identify users most likely to be shorter-term users and target them with nudges to prolong or increase their engagement. Through the descriptive and cohort analyses, we discovered that there is no notable difference betweenshort vs. longer term users with respect to the number of stories viewed on the first day, numberof types of story, and which app feature is used to access stories. As seen in Figure YY, the rate ofdecrease in app usage over time is similar across groups of users with varying levels of first-weekusage. However, further analysis helped uncover the fact that around weeks 8-10 is an importantinflection point for user churn. In order to develop a successful nudging strategy, it is necessary toidentify the users who are likely to have less than 10 weeks of tenure.

Preliminaries

We will use the output created from the R Markdown file titled master_clustering.Rmd.

In this section, we can define the response variable we want to predict, as well as select the variables we want as predictors. In order for S2M to predict on different outputs and select different inputs, the values within the " " need to be changed.

Prediction Models

Through the clustering and data analysis, we identified the Average Number of Stories, Number ofSessions, and Unique Sources to be the main predictors driving long-term tenure, those acted as ourfeatures, while the Number of Days in Freadom since Signup as the response. The time window touse in the features was left as a hypertuning parameter.We developed a supervised learning prediction model of the tenure, to evaluate the performance ofdifferent approaches we used as metric the Root Mean Squared Error (RMSE) to see how off arethe predictions from the true values (on average).

Tree Based Methods

Best Performing Model: Bagging

We use the libarary Random Forest to fit a Regression Tree Bootstrap Aggregation method by specifying that all variables (mtry = 3) will be taken into account at each split. We will plot the out of bag (OBB) error to get some information on the prediction error depending on the number of trees created.

## 
## Call:
##  randomForest(formula = child_Train_Y[, ] ~ ., data = child_Train_X,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 3198.009
##                     % Var explained: -2.58

We then run predictions on the test set and plot the True Values vs the Predictions. The closer the line looks to an ascending 45 degree incline, the lower the bias is.

## [1] "The RMSE of the Bagging model is: 47.6010917162449"
## [1] "The RMSE of the Bagging model is: 57.1622189482128"

To counter the high bias seen on the Linear Models, flexible methods such as Tree-Based can work. Individual trees suffer from high variance- in order to decrease the variance while maintaining a low bias a tree averaging method such as the Bagging (or Bootstrapped Aggregation) model Agets to a low bias and variance. In order to decrease variance while maintaining sampling costs equal, the Bagging method uses apowerful statistical device known as Bootstrapping: by taking repeated samples with replacement from within a pre-sampled set of observations, we are able to synthetically obtain B number of samples and train an statistical model on each of them. Afterwards, we can average over the predictions of all B models to obtain a final prediction.

Based on the initial analysis we performed, S2M should use the Bagging method. In case S2M wants to explore further, below we present other methods we used.

Regression Tree

## 
## Regression tree:
## tree(formula = child_Train_Y[, ] ~ ., data = child_Train_X)
## Variables actually used in tree construction:
## [1] "n_sessions_first3wks"              "avg_n_stories_first3wkly_sessions"
## Number of terminal nodes:  4 
## Residual mean deviance:  2933 = 31050000 / 10590 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -71.50  -35.36  -22.36    0.00   19.87  221.60

Plot of the full tree.

Evaluating the CV error of the pruned trees and plotting their deviance (roughly interpreted as the sum of squared error):

##        Length Class  Mode     
## size   3      -none- numeric  
## dev    3      -none- numeric  
## k      3      -none- numeric  
## method 1      -none- character

Computing the RMSE of the full tree

## [1] "The RMSE of the full tree is  : 54.1445693427257"
## [1] "The RMSE of the full tree is  : 55.8302830049567"

Random Forest

##                 Length Class  Mode     
## call                5  -none- call     
## type                1  -none- character
## predicted       10590  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times       10590  -none- numeric  
## importance          6  -none- numeric  
## importanceSD        3  -none- numeric  
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y               10590  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

Calculating the RMSE

## [1] "The RMSE of the Random Forest model is: 48.3819323196901"
## [1] "The RMSE of the Random Forest model is: 56.0545478170325"

Linear Methods and Extensions

Lasso Shrinkage

## 3 x 1 sparse Matrix of class "dgCMatrix"
##                                          s0
## unique_sources_first3wks          -27.56393
## n_sessions_first3wks               52.58239
## avg_n_stories_first3wkly_sessions  90.44266

Calculating the RMSE

## [1] "The Train RMSE of the Lasso Model is: 55.1853787164738"
## [1] "The RMSE of the Lasso Model is: 56.9500496048001"

####SUMMARY OF ALL RMSE#### ###########################

## [1] "The Train RMSE of the Linear Regression : 55.1853499149659"
## [1] "The Train RMSE of the Lasso             : 55.1853787164738"
## [1] "The Train RMSE of the Decision Tree     : 54.1445693427257"
## [1] "The Train RMSE of the Boosting Model    : 53.4835401890399"
## [1] "The Train RMSE of the Bagging Model     : 47.6010917162449"
## [1] "The Train RMSE of the Random Forest     : 48.3819323196901"
## [1] "The RMSE of the Linear Regression : 56.9501433881364"
## [1] "The RMSE of the Lasso             : 56.9500496048001"
## [1] "The RMSE of the Decision Tree     : 55.8302830049567"
## [1] "The RMSE of the Boosting Model    : 55.0622018237479"
## [1] "The RMSE of the Bagging Model     : 57.1622189482128"
## [1] "The RMSE of the Random Forest     : 56.0545478170325"

Implementation Pipeline

Our proposal and implementation pipeline for S2M from this code file and model is as follows:

  • Step 1: Run master_clustering.Rmd
  • Step 2: Use master_prediction.Rmd to identify those users likely to stay for less than 10 weeks
  • Step 3: Nudge users below 10 weeks